-
Notifications
You must be signed in to change notification settings - Fork 159
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[FEAT] Enable buffered iteration on plans #2566
Conversation
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #2566 +/- ##
=======================================
Coverage ? 64.02%
=======================================
Files ? 951
Lines ? 107920
Branches ? 0
=======================================
Hits ? 69101
Misses ? 38819
Partials ? 0
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good! Some minor nits that would be good to fix if the docstrings are public facing
Co-authored-by: Desmond Cheong <[email protected]>
Co-authored-by: Desmond Cheong <[email protected]>
Together with #2566 , closes #2561 This PR changes the way the PyRunner performs resource accounting. Instead of updating the number of CPUs, GPUs and memory used only when futures are retrieved, we do this just before each task completes. These variables are protected with a lock to allow for concurrent access from across worker threads. Additionally, this PR now tracks the inflight `Futures` across all executions globally in the PyRunner singleton. This is because there will be instances where a single execution might not be able to make forward progress (e.g. there are only 8 CPUs available, and there are 8 other currently-executing partitions). In this case, we need to wait for **some** execution globally to complete before attempting to make forward progress on the current execution. --------- Co-authored-by: Jay Chia <[email protected]@users.noreply.github.com>
Helps close part of #2561
This PR enables buffering of result partition tasks, preventing "runaway execution" of executions when run concurrently.
The problem previously was that if we ran two executions in parallel (
e1
ande2
) on a machine with 8 CPUs:e1
could potentially run8
tasks and keep them buffered (not releasing the resource request)e2
attempts to run the next task, it notices that the task cannot be admitted on the system (due to memory constraints)e2
thinks that it is deadlocking because there is a strong assumption in the pyrunner today that if a task cannot be admitted, it merely has to wait for some other tasks in the same execution to finish up.e2
doesn't have any tasks currently pending (because it is starved). The pending tasks are all buffered ine1
. Thus it thinks that it is deadlocking.Solution
1
instead of allowing each execution to run as many tasks as it wantsNone
to indicate that the plan is unable to proceed.Note that there is still potentially a problem here, e.g. running > NUM_CPU number of executions concurrently. That can be solved in a follow-up PR for refactoring the way we do resource accounting.